Date Objects

import datetime as dt
import pandas as pd
import dateutil as du

Though economic or financial time series data are neatly labeled with time stamp which doesn’t require much extra manipulations, we will still cover the basics of datetime library in this chapter. It will be useful when engaging in commercial time series data rather than economic or financial ones.

All necessary libraries have been imported at the top of file. First we create a date object on Christmas in 2021.

christmas_2021 = dt.date(2021, 12, 25)

christmas_2021 is not a string, it is a date object.

print(christmas_2021)
type(christmas_2021)
2021-12-25
datetime.date

We can check what methods and attributes it has, we filter out the magic methods with a loop.

def filter_dunder(any_obj):
    temp_list = dir(any_obj)
    date_obj_meth_attr = []
    for i in temp_list:
        if i[0:2] != "__":
            date_obj_meth_attr.append(i)
    date_obj_meth_attr = {"meth_attr": date_obj_meth_attr}
    return pd.DataFrame(date_obj_meth_attr)
date_obj_meth_attr = filter_dunder(christmas_2021)
date_obj_meth_attr
meth_attr
0 ctime
1 day
2 fromisocalendar
3 fromisoformat
4 fromordinal
5 fromtimestamp
6 isocalendar
7 isoformat
8 isoweekday
9 max
10 min
11 month
12 replace
13 resolution
14 strftime
15 timetuple
16 today
17 toordinal
18 weekday
19 year

Obviously year, month, day and etc. are attributes.

print(christmas_2021.year)
print(christmas_2021.month)
print(christmas_2021.day)
2021
12
25

However, weekday is a method, which returns the day of the week, by default Monday \(=0\).

christmas_2021.weekday()
5

isocalendar() returns the number of weeks and day of the week.

christmas_2021.isocalendar()
datetime.IsoCalendarDate(year=2021, week=51, weekday=6)

Subtracting Days

Create another date object of 2021-11-11 which is Polish Independence Day (PID), then calculate how many days between PID and Christmas?

inde_day_2021 = dt.date(2021, 11, 11)  # Polish Independence day
date_delta = christmas_2021 - inde_day_2021
print(date_delta)
type(date_delta)
44 days, 0:00:00
datetime.timedelta

Note the result is an object called timedelta, it can be created mannually or automatically via any date objects’ subtraction.

delta_obj_meth_attr = filter_dunder(date_delta)
delta_obj_meth_attr
meth_attr
0 days
1 max
2 microseconds
3 min
4 resolution
5 seconds
6 total_seconds

Check how many seconds there are between two dates.

date_delta.total_seconds()
3801600.0

Or days.

date_delta.days
44

Time delta can be specified by us. Here shows the date one week before PID.

offset = dt.timedelta(weeks=1)
inde_day_2021 - offset
datetime.date(2021, 11, 4)

Datetime Objects

datatime object grants us detailed time stamp at seconds’ level. For instance, we create a datetime object at midnight of Christmas then convert to string by .isoformat() method.

christmas_2021_midnight = dt.datetime(2021, 12, 25, 23, 59, 59)
christmas_2021_midnight.isoformat()
'2021-12-25T23:59:59'

datetime is more flexible and has much more methods and attributes.

filter_dunder(christmas_2021_midnight)
meth_attr
0 astimezone
1 combine
2 ctime
3 date
4 day
5 dst
6 fold
7 fromisocalendar
8 fromisoformat
9 fromordinal
10 fromtimestamp
11 hour
12 isocalendar
13 isoformat
14 isoweekday
15 max
16 microsecond
17 min
18 minute
19 month
20 now
21 replace
22 resolution
23 second
24 strftime
25 strptime
26 time
27 timestamp
28 timetuple
29 timetz
30 today
31 toordinal
32 tzinfo
33 tzname
34 utcfromtimestamp
35 utcnow
36 utcoffset
37 utctimetuple
38 weekday
39 year

.replace() can modify the time attributes and assign onto an object.

christmas_eve_2022_noon = christmas_2021_midnight.replace(
    year=2022, month=12, day=24, hour=11, minute=59, second=59
)
print(christmas_eve_2022_noon)
2022-12-24 11:59:59

Here are most common attributes of datetime object.

print(christmas_2021_midnight.year)
print(christmas_2021_midnight.month)
print(christmas_2021_midnight.day)
print(christmas_2021_midnight.hour)
print(christmas_2021_midnight.minute)
print(christmas_2021_midnight.second)
2021
12
25
23
59
59

Formats

.strptime() turn strings into datetime object (p means parse), .strftime() turns datetime object into strings (f means format).

christmas_2021.isoformat()  # ordinary ISO format
'2021-12-25'

Check here to find a proper format codes.

christmas_2021.strftime("%d-%m-%Y")
'25-12-2021'
christmas_2021.strftime("%d-%B-%Y")
'25-December-2021'

Use whatever separator you prefer, what matthers are the designated symbol right after %.

christmas_2021.strftime("%Y  ~*~+#  %m  ~*~#&*  %d")
'2021  ~*~+#  12  ~*~#&*  25'
christmas_2021.strftime("%Y, the number of day %j")
'2021, the number of day 359'

Load Data

Now we will try a dateset from Kaggle. It records the time, duration and location of bikes from a bike sharing business.

df_bike = pd.read_csv("../dataset/metro-bike-share-trip-data.csv")
df_bike.head(3)
Trip ID Duration Start Time End Time Starting Station ID Starting Station Latitude Starting Station Longitude Ending Station ID Ending Station Latitude Ending Station Longitude ... Trip Route Category Passholder Type Starting Lat-Long Ending Lat-Long Neighborhood Councils (Certified) Council Districts Zip Codes LA Specific Plans Precinct Boundaries Census Tracts
0 18222186 15060 2017-01-19T17:05:00.000 2017-01-19T21:16:00.000 3031.0 34.044701 -118.252441 3000.0 NaN NaN ... One Way Walk-up NaN NaN NaN NaN NaN NaN NaN NaN
1 9518671 77160 2016-10-09T14:37:00.000 2016-10-10T12:03:00.000 NaN NaN NaN 3000.0 NaN NaN ... One Way Monthly Pass NaN NaN NaN NaN NaN NaN NaN NaN
2 20444932 86400 2017-02-18T10:15:00.000 2017-02-20T15:20:00.000 3026.0 34.063179 -118.245880 3000.0 NaN NaN ... One Way Walk-up NaN NaN NaN NaN NaN NaN NaN NaN

3 rows × 22 columns

Take a look at what data type is in Start Time and End Time.

type(df_bike["Start Time"][0])
str

They are strings, Python won’t understand it, therefore they must be converted it to datetime object before any futher analysis.

Parsing String As Datetime Object

By converting, we actually mean parsing. First we define a format variable that matches exactly the strings’ format in the dataframe.

And try an example by picking the first element of Start Time.

fmt = "%Y-%m-%dT%H:%M:%S.%f"
s = df_bike["Start Time"][0]

Parse it into datetime object.

d = dt.datetime.strptime(s, fmt)
d
datetime.datetime(2017, 1, 19, 17, 5)

As you can see, we have parsed the string into a datetime object, however to parse the whole column, the easier way would be applying lambda function onto the whole column of strings.

df_bike["StartTime_Parsed"] = df_bike["Start Time"].apply(
    lambda x: dt.datetime.strptime(x, fmt)
)
df_bike["EndTime_Parsed"] = df_bike["End Time"].apply(
    lambda x: dt.datetime.strptime(x, fmt)
)

The newly added columns are at the rightmost.

df_bike.head(3)
Trip ID Duration Start Time End Time Starting Station ID Starting Station Latitude Starting Station Longitude Ending Station ID Ending Station Latitude Ending Station Longitude ... Starting Lat-Long Ending Lat-Long Neighborhood Councils (Certified) Council Districts Zip Codes LA Specific Plans Precinct Boundaries Census Tracts StartTime_Parsed EndTime_Parsed
0 18222186 15060 2017-01-19T17:05:00.000 2017-01-19T21:16:00.000 3031.0 34.044701 -118.252441 3000.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 2017-01-19 17:05:00 2017-01-19 21:16:00
1 9518671 77160 2016-10-09T14:37:00.000 2016-10-10T12:03:00.000 NaN NaN NaN 3000.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 2016-10-09 14:37:00 2016-10-10 12:03:00
2 20444932 86400 2017-02-18T10:15:00.000 2017-02-20T15:20:00.000 3026.0 34.063179 -118.245880 3000.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 2017-02-18 10:15:00 2017-02-20 15:20:00

3 rows × 24 columns

Duration can be conveniently computed by subtraction.

df_bike["duration"] = df_bike["EndTime_Parsed"] - df_bike["StartTime_Parsed"]
df_bike.head()
Trip ID Duration Start Time End Time Starting Station ID Starting Station Latitude Starting Station Longitude Ending Station ID Ending Station Latitude Ending Station Longitude ... Ending Lat-Long Neighborhood Councils (Certified) Council Districts Zip Codes LA Specific Plans Precinct Boundaries Census Tracts StartTime_Parsed EndTime_Parsed duration
0 18222186 15060 2017-01-19T17:05:00.000 2017-01-19T21:16:00.000 3031.0 34.044701 -118.252441 3000.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 2017-01-19 17:05:00 2017-01-19 21:16:00 0 days 04:11:00
1 9518671 77160 2016-10-09T14:37:00.000 2016-10-10T12:03:00.000 NaN NaN NaN 3000.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 2016-10-09 14:37:00 2016-10-10 12:03:00 0 days 21:26:00
2 20444932 86400 2017-02-18T10:15:00.000 2017-02-20T15:20:00.000 3026.0 34.063179 -118.245880 3000.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 2017-02-18 10:15:00 2017-02-20 15:20:00 2 days 05:05:00
3 20905031 18840 2017-02-27T12:26:00.000 2017-02-27T17:40:00.000 3023.0 34.050911 -118.240967 3000.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 2017-02-27 12:26:00 2017-02-27 17:40:00 0 days 05:14:00
4 21031476 86400 2017-02-27T20:26:00.000 2017-03-01T09:49:00.000 3008.0 34.046612 -118.262733 3000.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 2017-02-27 20:26:00 2017-03-01 09:49:00 1 days 13:23:00

5 rows × 25 columns

print("Minimum duration: " + str(min(df_bike["duration"])))
print("Maximum duration: " + str(max(df_bike["duration"])))
Minimum duration: 0 days 00:01:00
Maximum duration: 5 days 16:28:00

To show total seconds of duration.

df_bike["duration"].dt.total_seconds()
0          15060.0
1          77160.0
2         191100.0
3          18840.0
4         134580.0
            ...   
132422      2520.0
132423      1080.0
132424      1500.0
132425       780.0
132426       300.0
Name: duration, Length: 132427, dtype: float64

Compute the mean duration.

df_bike["duration"].mean()
Timedelta('0 days 00:28:13.423848610')

Unix Timestamp

The Unix timestamp is the number of seconds elapsed since January 1st, 1970 at UTC, which is supposedly the day of first modern computer invented.

Take a look the example below, you will grasp the meaning of Unix timestamp.

timestamps = [0, 1, 2, 3000000, 100000000, 1600000000, 2073658356]

for ts in timestamps:
    print(dt.datetime.fromtimestamp(ts))
1970-01-01 00:00:00
1970-01-01 00:00:01
1970-01-01 00:00:02
1970-02-04 17:20:00
1973-03-03 09:46:40
2020-09-13 12:26:40
2035-09-17 16:12:36

Time Zone And UTC Offset

UTC stands for Coordinated Universal Time, which is the reference time zone to regulate the world’s clock. UK winter time is UTC+0, any time different from it is called UTC offset, for instance the UTC offset of Poland in winter is UTC+1, it means Poland is one hour ahead than UK in the winter.

You can specify the time zone info for the datetime object, here we use UTC.

national_day_event_UK_2022 = dt.datetime(
    2022, 4, 23, 11, 14, 15, tzinfo=dt.timezone.utc
)

So UTC offset is naturally \(+00:00\).

national_day_event_UK_2022.isoformat()
'2022-04-23T11:14:15+00:00'

Create a timezone for Eastern European Time, UTC+2, which is the winter time in Helsinki/Tallinn. The method is to create a timedelta, then use it as the UTC offset, tzinfo=eet.

eet = dt.timezone(dt.timedelta(hours=2))
national_day_event_UK_2022 = dt.datetime(2022, 4, 23, 18, 14, 15, tzinfo=eet)
national_day_event_UK_2022.isoformat()
'2022-04-23T18:14:15+02:00'

If you think the time zone or UTC offset was a mistake, then replace it with a new one. Here I just print it out, rather than assign to a new datetime object.

beijing_time = dt.timezone(dt.timedelta(hours=8))
print(national_day_event_UK_2022.replace(tzinfo=beijing_time))
2022-04-23 18:14:15+08:00

Display local time in new time zone.

national_day_event_UK_2022.astimezone(dt.timezone.utc).isoformat()
'2022-04-23T16:14:15+00:00'

IANA Timezone

If you know location, e.g. city name, you can call IANA timezone time directly, such as Europe/Paris or Asia/Chungking.

We will use dateutil library, which has been imported at the top of this file.

ck_time = du.tz.gettz("Asia/Chungking")
national_day_event_UK_2022.astimezone(ck_time).isoformat()
'2022-04-24T00:14:15+08:00'
ny_time = du.tz.gettz("America/New_York")
national_day_event_UK_2022.astimezone(ny_time).isoformat()
'2022-04-23T12:14:15-04:00'

Winter Time Change

Time change is a common practice in European Union. For instance, Poland changes to winter time at 3:00AM, 31th Oct 2021, which means the clock rewinds back to 2:00AM at 3:00AM. We will take a look at time offset before and after the change.

before = dt.datetime(2021, 10, 31, 1, 0, 0, tzinfo=du.tz.gettz("Europe/Warsaw"))
after = before + dt.timedelta(hours=6)
print(before.isoformat() + " to " + after.isoformat())
2021-10-31T01:00:00+02:00 to 2021-10-31T07:00:00+01:00

As you can see the UTC offset changed from +02:00 to +01:00 if we add \(6\) hours across the time change hour, however the time change doesn’t seem to reflect in the hours addition? Elapsing \(6\) hours from T01:00:00 should be at T06:00:00 due to one hour shift backward.

We can verify how many hours that Python assumes elapsed.

(after - before).seconds / (60**2)
6.0

If we use UTC as a reference time zone.

utc_delta = after.astimezone(dt.timezone.utc) - before.astimezone(dt.timezone.utc)
utc_delta.seconds / (60**2)
7.0
after.astimezone(dt.timezone.utc)
datetime.datetime(2021, 10, 31, 6, 0, tzinfo=datetime.timezone.utc)
before.astimezone(dt.timezone.utc)
datetime.datetime(2021, 10, 30, 23, 0, tzinfo=datetime.timezone.utc)

Now use a loop to show if Oct 31 is an ambiguous date.

end_Oct = dt.datetime(2000, 10, 31, 1, 0, 0, tzinfo=du.tz.gettz("Europe/Warsaw"))
for y in range(2000, 2050):
    print(end_Oct.replace(year=y).isoformat())
2000-10-31T01:00:00+01:00
2001-10-31T01:00:00+01:00
2002-10-31T01:00:00+01:00
2003-10-31T01:00:00+01:00
2004-10-31T01:00:00+02:00
2005-10-31T01:00:00+01:00
2006-10-31T01:00:00+01:00
2007-10-31T01:00:00+01:00
2008-10-31T01:00:00+01:00
2009-10-31T01:00:00+01:00
2010-10-31T01:00:00+02:00
2011-10-31T01:00:00+01:00
2012-10-31T01:00:00+01:00
2013-10-31T01:00:00+01:00
2014-10-31T01:00:00+01:00
2015-10-31T01:00:00+01:00
2016-10-31T01:00:00+01:00
2017-10-31T01:00:00+01:00
2018-10-31T01:00:00+01:00
2019-10-31T01:00:00+01:00
2020-10-31T01:00:00+01:00
2021-10-31T01:00:00+02:00
2022-10-31T01:00:00+01:00
2023-10-31T01:00:00+01:00
2024-10-31T01:00:00+01:00
2025-10-31T01:00:00+01:00
2026-10-31T01:00:00+01:00
2027-10-31T01:00:00+02:00
2028-10-31T01:00:00+01:00
2029-10-31T01:00:00+01:00
2030-10-31T01:00:00+01:00
2031-10-31T01:00:00+01:00
2032-10-31T01:00:00+02:00
2033-10-31T01:00:00+01:00
2034-10-31T01:00:00+01:00
2035-10-31T01:00:00+01:00
2036-10-31T01:00:00+01:00
2037-10-31T01:00:00+01:00
2038-10-31T01:00:00+01:00
2039-10-31T01:00:00+01:00
2040-10-31T01:00:00+01:00
2041-10-31T01:00:00+01:00
2042-10-31T01:00:00+01:00
2043-10-31T01:00:00+01:00
2044-10-31T01:00:00+01:00
2045-10-31T01:00:00+01:00
2046-10-31T01:00:00+01:00
2047-10-31T01:00:00+01:00
2048-10-31T01:00:00+01:00
2049-10-31T01:00:00+01:00

Ambiguous Datetime

Ambiguous time happens when there is time change, at certain time can appear twice. du.tz.datetime_ambiguous() can test this.

dt_1 = dt.datetime(2020, 10, 31, 2, 35, 10, tzinfo=du.tz.gettz("Europe/Warsaw"))
du.tz.datetime_ambiguous(dt_1)
False

At 3AM, 31th Oct 2021, the clock will be dialed back one hour at 2AM in Poland. So any datetime object appears between 2AM and 3AM will appear twice, this is what we call ambiguous time.

dt_2 = dt.datetime(2021, 10, 31, 2, 35, 10, tzinfo=du.tz.gettz("Europe/Warsaw"))
du.tz.datetime_ambiguous(dt_2)
True

Parsing Datetime With Pandas

Load data again and check the info summary.

You can see the Start Time and End Time are strings (Dtype: object).

df_bike = pd.read_csv("../dataset/metro-bike-share-trip-data.csv")
df_bike.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132427 entries, 0 to 132426
Data columns (total 22 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Trip ID                            132427 non-null  int64  
 1   Duration                           132427 non-null  int64  
 2   Start Time                         132427 non-null  object 
 3   End Time                           132427 non-null  object 
 4   Starting Station ID                132408 non-null  float64
 5   Starting Station Latitude          132379 non-null  float64
 6   Starting Station Longitude         132379 non-null  float64
 7   Ending Station ID                  132331 non-null  float64
 8   Ending Station Latitude            131376 non-null  float64
 9   Ending Station Longitude           131376 non-null  float64
 10  Bike ID                            132417 non-null  float64
 11  Plan Duration                      131661 non-null  float64
 12  Trip Route Category                132427 non-null  object 
 13  Passholder Type                    132427 non-null  object 
 14  Starting Lat-Long                  98622 non-null   object 
 15  Ending Lat-Long                    131376 non-null  object 
 16  Neighborhood Councils (Certified)  98590 non-null   float64
 17  Council Districts                  98590 non-null   float64
 18  Zip Codes                          98590 non-null   float64
 19  LA Specific Plans                  15803 non-null   float64
 20  Precinct Boundaries                131328 non-null  float64
 21  Census Tracts                      131328 non-null  float64
dtypes: float64(14), int64(2), object(6)
memory usage: 22.2+ MB

We can parse the dates directly with parse_dates, then info summary shows they are converted to datetime.

df_bike = pd.read_csv(
    "../dataset/metro-bike-share-trip-data.csv", parse_dates=["Start Time", "End Time"]
)
df_bike.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132427 entries, 0 to 132426
Data columns (total 22 columns):
 #   Column                             Non-Null Count   Dtype         
---  ------                             --------------   -----         
 0   Trip ID                            132427 non-null  int64         
 1   Duration                           132427 non-null  int64         
 2   Start Time                         132427 non-null  datetime64[ns]
 3   End Time                           132427 non-null  datetime64[ns]
 4   Starting Station ID                132408 non-null  float64       
 5   Starting Station Latitude          132379 non-null  float64       
 6   Starting Station Longitude         132379 non-null  float64       
 7   Ending Station ID                  132331 non-null  float64       
 8   Ending Station Latitude            131376 non-null  float64       
 9   Ending Station Longitude           131376 non-null  float64       
 10  Bike ID                            132417 non-null  float64       
 11  Plan Duration                      131661 non-null  float64       
 12  Trip Route Category                132427 non-null  object        
 13  Passholder Type                    132427 non-null  object        
 14  Starting Lat-Long                  98622 non-null   object        
 15  Ending Lat-Long                    131376 non-null  object        
 16  Neighborhood Councils (Certified)  98590 non-null   float64       
 17  Council Districts                  98590 non-null   float64       
 18  Zip Codes                          98590 non-null   float64       
 19  LA Specific Plans                  15803 non-null   float64       
 20  Precinct Boundaries                131328 non-null  float64       
 21  Census Tracts                      131328 non-null  float64       
dtypes: datetime64[ns](2), float64(14), int64(2), object(4)
memory usage: 22.2+ MB

Or another method to use to_datetime function.

df_bike = pd.read_csv("../dataset/metro-bike-share-trip-data.csv")
df_bike["Start Time"] = pd.to_datetime(
    df_bike["Start Time"], format="%Y-%m-%dT%H:%M:%S.%f"
)
df_bike.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132427 entries, 0 to 132426
Data columns (total 22 columns):
 #   Column                             Non-Null Count   Dtype         
---  ------                             --------------   -----         
 0   Trip ID                            132427 non-null  int64         
 1   Duration                           132427 non-null  int64         
 2   Start Time                         132427 non-null  datetime64[ns]
 3   End Time                           132427 non-null  object        
 4   Starting Station ID                132408 non-null  float64       
 5   Starting Station Latitude          132379 non-null  float64       
 6   Starting Station Longitude         132379 non-null  float64       
 7   Ending Station ID                  132331 non-null  float64       
 8   Ending Station Latitude            131376 non-null  float64       
 9   Ending Station Longitude           131376 non-null  float64       
 10  Bike ID                            132417 non-null  float64       
 11  Plan Duration                      131661 non-null  float64       
 12  Trip Route Category                132427 non-null  object        
 13  Passholder Type                    132427 non-null  object        
 14  Starting Lat-Long                  98622 non-null   object        
 15  Ending Lat-Long                    131376 non-null  object        
 16  Neighborhood Councils (Certified)  98590 non-null   float64       
 17  Council Districts                  98590 non-null   float64       
 18  Zip Codes                          98590 non-null   float64       
 19  LA Specific Plans                  15803 non-null   float64       
 20  Precinct Boundaries                131328 non-null  float64       
 21  Census Tracts                      131328 non-null  float64       
dtypes: datetime64[ns](1), float64(14), int64(2), object(5)
memory usage: 22.2+ MB

Pandas can also set the timezone by tz_localize().

df_bike["Start Time"].head(3).dt.tz_localize("Europe/Helsinki")
0   2017-01-19 17:05:00+02:00
1   2016-10-09 14:37:00+03:00
2   2017-02-18 10:15:00+02:00
Name: Start Time, dtype: datetime64[ns, Europe/Helsinki]